The network data for this project comes from the Longitudinal Employer-Household Dynamics (LEHD), a program of the Center for Economic Studies at the U.S. Census Bureau. They maintain the LEHD Origin-Destination Employment Statistics (LODES), which records the number of people who live in one census block and work in another for year from 2002 to 2022. The complete data represent a longitudinal, directed, weighted network with millions of nodes.
For this project, I’m only using the LODES for one city—Chicago—and
one year—2022. The 01-fetch.R script fetches these data
from the LODES FTP server (https://lehd.ces.census.gov/data/lodes/LODES8/).
The 01-fetch.R script fetches two additional data
sources that are used in my analysis and visualizations. First, we need
the census tract geographic shapes, which are provided by the TIGER API.
Here
is the documentation. Second, since my analysis occurs at a
neighborhood level, we need a way to crosswalk from census tracts to
Chicago community areas (CCAs). UChicago’s
Spatial Data Resources provides a file with the boundaries of all
Chicago community areas that we can use for this purpose.
Chicago has 77 community areas as of 2025:
After 01-fetch.R saves these three data sets to files in
the data/ directory, the 02-tracts.R and
03-community-areas.R perform the following preprocessing
steps:
"OUTSIDE CHICAGO".Finally, the processed data are saved to
data/ccas.geojson and data/cca_flows.csv.
One simple result we can draw from this dataset is simply where people who live in any given commuting area work. For example, below we compare the commuting flows for residents of Hyde Park and Woodlawn. Hyde Park and Woodlawn are adjacent, but because Hyde Park is home to the University of Chicago, most of its residents work there. Woodlawn, on the other hand, has a shortage of local jobs, so most Woodlawn residents commute to the Loop.
When we load the CCA commuting flows into igraph, here
is what we get:
## IGRAPH 65b2c6a DNW- 78 6084 --
## + attr: name (v/c), w_from_chicago (v/n), w_total (v/n), work_in_same
## | (v/n), h_in_chicago (v/n), h_avg_distance (v/n), h_median_distance
## | (v/n), h_total (v/n), num (v/n), distance_from_loop (v/n), geometry
## | (v/x), n (e/n), distance (e/n), weight (e/n)
Notice that the number of edges is exactly the number of nodes squared. This is because all dyads are included in this network, even those that have zero commuting flow. The maximum number of edges is \(n^2\) rather than \(n(n-1)/2\) because self-edges are allowed.
Here is a plot of the network with nodes colored according to their distance from the Loop:
The black dot in the middle is the Loop (downtown Chicago). The out-of-place lighter blue dot near the central part of the network is O’Hare, which employs a lot of people despite being far from downtown.
In degree (number who work in this community area) is recorded in the
w_total column:
## # A tibble: 6 × 2
## name w_total
## <chr> <int>
## 1 Loop 442777
## 2 Near North Side 201444
## 3 Near West Side 145641
## 4 Ohare 57408
## 5 West Town 32620
## 6 Hyde Park 27029
The commuting areas with the most jobs are those in the central city (the Loop, Near North Side, Near West Side, and West Town) or those that contain a major employer (O’Hare and Hyde Park).
Out degree (number of workers who live in this community area) is
recorded in the h_total column:
## # A tibble: 10 × 2
## name h_total
## <chr> <int>
## 1 Lake View 53699
## 2 Near North Side 53596
## 3 West Town 46419
## 4 Logan Square 38107
## 5 Austin 35406
## 6 Lincoln Park 33749
## 7 Near West Side 31099
## 8 Belmont Cragin 28989
## 9 West Ridge 28168
## 10 Uptown 28028
Out degree isn’t characterized by a few extreme outliers like in degree is. However, notice the clear trend on the map: North Side neighborhoods tend to have more working residents than South Side neighborhoods, where unemployment tends to be higher.
Not surprisingly, neighborhoods with more working residents tend to also have more jobs.
## # A tibble: 10 × 2
## name eig_centrality
## <chr> <dbl>
## 1 Loop 1
## 2 Near North Side 0.847
## 3 Near West Side 0.478
## 4 Lake View 0.462
## 5 West Town 0.366
## 6 Lincoln Park 0.306
## 7 Logan Square 0.260
## 8 Uptown 0.201
## 9 Edgewater 0.167
## 10 Austin 0.154
The node with highest eigenvector centrality is the Loop, followed by the neighborhoods north and west of the Loop. South Side community uniformly have very low eigenvector centrality.
The plot below highlights the relationships between eigenvector centrality and geographic location.
This plot shows the number of local jobs per working resident in each community area. In neighborhoods with the largest employment, like O’Hare and the Loop, there are many more jobs than working residents. By contrast, many neighborhoods have nearly ten times as many working residents as jobs, so almost everyone must commute.
Intuitively, we would expect that commuting is higher between neighborhoods that are closer together.
Modeling this dataset with an exponential random graph model (ERGM) would help understand what factors determine where people work. However, the ERGMs we learned about in class only work with unweighted networks. Krivitsky (2012) extended the ERGM framework to networks with values that represent counts, which is what we have here.
chicago_only <- cca_graph %N>%
filter(name != "OUTSIDE CHICAGO")
net <- asNetwork(chicago_only)
# Add node covariates
net %v% "residents_log" <- log(chicago_only %N>% pull(h_in_chicago))
net %v% "workers_log" <- log(chicago_only %N>% pull(w_from_chicago))
# Network covariate: geographic distances
# Make a full distance matrix. Recall that all dyads are represented in
# cca_flows, not just non-zero edges.
D <- cca_flows %>%
filter(from != "OUTSIDE CHICAGO", to != "OUTSIDE CHICAGO") %>%
select(from, to, distance) %>%
pivot_wider(names_from = to, values_from = distance) %>%
column_to_rownames("from") %>%
as.matrix()
fit1 <- ergm(
net ~
sum + # baseline intensity
nonzero + # sparsity
nodeocov("residents_log") + # origin size effect
nodeicov("workers_log") + # destination size effect
edgecov(D), # deterrence by distance
reference = ~Poisson,
response = "n",
)
Research question: Do commuting patterns at the start of the 21st century help explain why some Chicago neighborhoods experienced greater economic improvement over the following two decades?
Dependent variable: Change in neighborhood SES 2002-2022. Could be measured by:
Independent variable: Commuting network measures. Could include:
Questions and concerns:
Next steps: